Step 3/3 Feat/temporal filters in rag pipeline#284
Conversation
📝 WalkthroughWalkthroughAdds a prompt-eval benchmarking suite (datasets, reports, prompts, CLI evaluators), introduces structured query models with temporal predicates ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ChatRAG as ChatRAG Pipeline
participant LLM as Query Generator
participant Retriever as RetrieverPipeline
participant Milvus as Milvus DB
participant Judge as LLM Judge
User->>ChatRAG: sends last message
ChatRAG->>LLM: format prompt (current_date, query_language)
LLM-->>ChatRAG: return SearchQueries (query_list of Query objects)
ChatRAG->>Retriever: retrieve_docs(Query)
Retriever->>Retriever: Query.to_milvus_filter() -> filter string
Retriever->>Milvus: vector search with filter
Milvus-->>Retriever: results
alt filtered results empty & allow_filterless_fallback
Retriever->>Milvus: retry vector search without filter
Milvus-->>Retriever: results
end
Retriever-->>ChatRAG: documents
ChatRAG->>User: final response with sources
Note over Judge: (async) Judge evaluates semantic coverage or filter correctness
ChatRAG->>Judge: send case + generated output
Judge-->>ChatRAG: verdict (covered/correct + reasoning)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
22fc7d7 to
e043548
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
prompts/example1/spoken_style_answer_tmpl.txt (1)
22-22:⚠️ Potential issue | 🟡 MinorFix typo in user-facing prompt text.
"humain" should be "human".
📝 Proposed fix
- * Keep your answer succinct and straight to the point as if speaking to humain. + * Keep your answer succinct and straight to the point as if speaking to human.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prompts/example1/spoken_style_answer_tmpl.txt` at line 22, Fix the typo in the user-facing prompt template: in the prompt string that begins "Keep your answer succinct and straight to the point as if speaking to humain." (in spoken_style_answer_tmpl.txt / the "Keep your answer..." prompt), change "humain" to "human" so the sentence reads "...speaking to human."openrag/components/pipeline.py (1)
294-301:⚠️ Potential issue | 🔴 CriticalFix web search calls to pass query string instead of Query object.
At lines 297 and 320,
web_search_service.search()receives aQueryPydantic model instead of the expectedstr. This silently callsQuery.__str__(), which returnsf"Query: {self.query}, Filter: {self.to_milvus_filter()}", polluting the search engine with literal tokens ("Query:", "Filter:") and Milvus filter expressions, significantly degrading result quality.Fix
- web_tasks = [self.web_search_service.search(q) for q in queries.query_list] + web_tasks = [self.web_search_service.search(q.query) for q in queries.query_list]Apply the same fix at line 320 (web-only branch).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 294 - 301, The web search is being called with a Query Pydantic model instead of the raw string, which causes Query.__str__() to pollute the search; update the web search invocations to pass the actual query string (use q.query) rather than the Query object: change the web_tasks comprehension that builds from queries.query_list to call self.web_search_service.search(q.query), and make the identical change in the web-only branch where web_tasks is constructed (the second occurrence near the other web-only logic).
♻️ Duplicate comments (1)
utility/data_indexer2.py (1)
1-121: 🛠️ Refactor suggestion | 🟠 MajorThis script appears to duplicate
data_indexer.pyand was reportedly not intended for upstream.Per prior discussion on this PR, you mentioned this file was only for local use and not meant to be pushed. It also largely duplicates
utility/data_indexer.pywith added PDF-only filtering, a 100-file cap, page-count bounds (11–40), and size/page statistics. If it is intentionally kept, please:
- Rename it to reflect its purpose (e.g.,
pdf_benchmark_indexer.py) rather thandata_indexer2.- Fold the shared logic (arg parsing,
__check_api, upload loop) into a common helper to avoid drift between the two scripts.- Document the rationale for the hardcoded
limit = 100and page bounds(10, 40]— these look benchmark-specific.Otherwise, consider dropping it from this PR since it is orthogonal to the temporal-filters feature.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utility/data_indexer2.py` around lines 1 - 121, This file duplicates existing data_indexer.py and shouldn't be upstream as-is; either remove it from the PR or rename it to reflect purpose (e.g., pdf_benchmark_indexer.py) and refactor common logic (arg parsing, __check_api, upload loop) into a shared helper module to prevent drift; if you keep it, add a short docstring near the top declaring its benchmark intent and justify the hard-coded limit = 100 and the page bounds check (the page_count <= 10 or page_count > 40 filter) so reviewers know they are intentional.
🧹 Nitpick comments (5)
utility/data_indexer.py (1)
36-46:__check_apidoes not fail on non-200 responses.If the health check returns e.g. 500 or 404, the function silently returns and the script proceeds to index. Only
httpx.RequestError(connection-level) is raised. Considerresponse.raise_for_status()or an explicit non-200 branch so a misconfigured URL/auth fails fast.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utility/data_indexer.py` around lines 36 - 46, The __check_api function currently only raises on httpx.RequestError and silently continues on non-200 responses; modify __check_api to detect non-200 HTTP responses (e.g., call response.raise_for_status() or explicitly check response.status_code != 200) and raise an exception (or log and raise) so the caller fails fast; update the try/except around httpx.get in __check_api to include HTTP status errors (using response.raise_for_status() is preferred) so any 4xx/5xx result causes the function to raise instead of returning silently.benchmarks/prompt_eval/Report_query_decomposition.md (1)
50-63: Optional: Consider varying sentence structure for readability.The static analysis tool flags repetitive sentence openings in the "Common hard cases" section. While not critical, varying the structure could improve readability.
✍️ Optional rewording suggestion
### Common hard cases (both models, both prompts) -- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query. -- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation. -- **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups. -- **id 74** — "EASA vs FAA certification": same pattern. +- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query. +- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation. +- For **id 70** ("air freight vs sea freight"), the model emits a single comparison instead of two independent lookups. +- **id 74** ("EASA vs FAA certification") exhibits the same pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/Report_query_decomposition.md` around lines 50 - 63, The "Common hard cases (both models, both prompts)" bullet list uses repetitive sentence openings and punctuation; please rewrite that section (the header and the four bullets referencing id 30, id 63, id 70, id 74) to vary sentence structure and lead-ins for readability — e.g., turn some bullets into full sentences starting with the case description ("Compare AWS Lambda and Google Cloud Functions") and others starting with the observed failure ("Earlier gRPC turn..."), replace repeated em-dash openings with varied connectors or clauses, and ensure each item still mentions the id and the failure mode (single comparison, contaminated reformulation, under-splitting) without changing meaning.benchmarks/prompt_eval/eval_temporal_filter_generation.py (1)
55-230: Extract shared infra witheval_query_decomposition.pyto prevent drift.
_parse_env_list,_build_models,_judge_config,_model_kwargs,build_llm_messages,format_prompt, theTemporalPredicate/Query/SearchQueriesschemas, and the CLI path-sandboxing logic are duplicated almost verbatim between the two evaluators. Any future change (e.g. addingstrict=True, switching to UTC, changing the defaultmax_completion_tokens) will need to be applied in both places, and already-present asymmetries will grow. Consider extracting abenchmarks/prompt_eval/_common.py(orlib/) module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py` around lines 55 - 230, Extract duplicated utilities and schemas into a shared module (e.g., benchmarks/prompt_eval/_common.py): move _parse_env_list, _build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt and the Pydantic models TemporalPredicate, Query, SearchQueries (and any CLI path-sandboxing helpers) into that module; then import and use them from eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both files reference the single implementation; ensure imports/exports preserve names and behavior (including default values like max_completion_tokens and any langdetect fallback), run lints/tests, and update any relative references to MODELS or judge_config usage to use the shared module.benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt (1)
31-42: Tighten thepast N/bare MONTHrules — the formulas drop the time unit.Line 37 writes
[Current − N, tomorrow)without carrying over the unit (days/weeks/months/years), so a model parsing "past 2 weeks" could read the lower bound asCurrent − 2. Line 41 ("nearest past occurrence") is ambiguous when the bare month equals the current month or is in the future (e.g. "December" queried in April 2026 — Dec 2025 or Dec 2026?). Consider:✏️ Proposed clarification
-- past N days/weeks/months/years → `[Current − N, tomorrow)` +- past N days/weeks/months/years → `[Current − N <same unit>, tomorrow)` (e.g. "past 2 weeks" → `[Current − 14 days, tomorrow)`) - recent / latest → treat as "past 90 days" - since X → one predicate `>= X` - before X → one predicate `< X` -- bare MONTH (no year) → nearest past occurrence +- bare MONTH (no year) → the most recent fully-elapsed occurrence of that month (if the named month equals the current month, use the occurrence one year prior)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt` around lines 31 - 42, The "past N" and "bare MONTH" rules are ambiguous: change the "past N days/weeks/months/years → `[Current − N, tomorrow)`" line to explicitly carry the time unit (e.g., "past N days/weeks/months/years → `[Current − N {days|weeks|months|years}, tomorrow)`") and ensure "weeks" and "months" are interpreted as whole-week and whole-month arithmetic (not unitless subtraction). Clarify the "bare MONTH (no year) → nearest past occurrence" rule by specifying deterministic behavior: if MONTH is before the current month use MONTH of the current year; if MONTH is the current month or after, use MONTH of the previous year (i.e., the most recent calendar occurrence strictly before today). Ensure these exact phrases (the "past N days/weeks/months/years" rule and the "bare MONTH (no year)" rule) are updated so models include the unit and the tie-breaker rule unambiguously.openrag/components/pipeline.py (1)
73-77: The double-quote syntax is correct; single and double quotes are both valid in Milvus TIMESTAMPTZ filter expressions.The codebase's use of double quotes in
ISO "2026-03-15T00:00:00+00:00"is valid. Milvus's expression parser accepts both single and double quotes for string literals, including ISO timestamp literals, and treats them equivalently. This is consistent across the codebase (tests, docstrings, and theto_milvus_filter()method).However, the broader design note stands:
TemporalPredicate.operatorpermits["==", "!=", ">", "<", ">=", "<="], but all tested and documented filter cases use only ordered comparisons (<,>,>=,<=). Equality and inequality checks on TIMESTAMPTZ fields are semantically unusual. If equality comparisons are not intentional, consider narrowing theLiteralto exclude"=="and"!=".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 73 - 77, The TemporalPredicate currently allows equality operators but we only use ordered comparisons for TIMESTAMPTZ; update the type/definition for the TemporalPredicate.operator (the Literal used where TemporalPredicate is declared) to remove "==" and "!=" so only "<", ">", "<=", ">=" are permitted; keep the to_milvus_filter() behavior (it can continue to emit ISO "..." strings) and ensure any type hints or validations referencing TemporalPredicate.operator (e.g., construction sites or validators) are updated to the narrower set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 1-32: The default dataset filename is misspelled as
"query_decompostion" in multiple places; update every occurrence to
"query_decomposition" — specifically fix the docstring references and the
argparse default for the --dataset option (the string
"datasets/query_decompostion.json" and any plain "query_decompostion"
occurrences) so that the script loads datasets/query_decomposition.json by
default and no longer raises FileNotFoundError when run without --dataset.
In `@openrag/components/pipeline.py`:
- Around line 112-122: The current silent fallback that re-runs
self.retriever.retrieve without filters when docs is empty should be made
explicit and configurable: add a pipeline config flag (e.g.,
allow_filterless_fallback) and only perform the unfiltered re-query when that
flag is True; when you do fallback, set a metadata flag (e.g.,
filter_dropped=True) on the returned result (or attach to the docs/container) so
callers can surface that the temporal filter was dropped; replace the
logger.debug call with logger.warning or logger.info so it’s visible at INFO
level and include context (query=query.query, partition, reason); keep the
original retriever call parameters (partition, query.query, filter_params) when
not falling back and ensure all code paths return the same result shape
including the new metadata key.
In `@utility/data_indexer.py`:
- Around line 57-60: The deterministic file_id generation (absolute_path ->
file_id) now causes duplicate POSTs to url =
f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" to return HTTP
409 and break reruns; update the uploader call in utility/data_indexer.py (the
block that posts to url) to treat 409 as non-fatal: catch the HTTP response or
exception from the POST, and if response.status_code == 409 (or the raised
HTTPError maps to 409), log an informative message and continue (skip that file)
instead of raising; otherwise preserve existing error handling for other status
codes so real failures still surface.
In `@utility/data_indexer2.py`:
- Around line 63-68: The try/except around fitz.open(file_path) currently
swallows all exceptions; change it to except Exception as e and log the failure
before falling back to page_count = 0 (e.g. logger.warning or logger.exception)
so operators can see why a PDF was skipped; update the block that calls
fitz.open(file_path) and sets page_count to capture the exception object (e) and
include file_path and e in the log message.
---
Outside diff comments:
In `@openrag/components/pipeline.py`:
- Around line 294-301: The web search is being called with a Query Pydantic
model instead of the raw string, which causes Query.__str__() to pollute the
search; update the web search invocations to pass the actual query string (use
q.query) rather than the Query object: change the web_tasks comprehension that
builds from queries.query_list to call self.web_search_service.search(q.query),
and make the identical change in the web-only branch where web_tasks is
constructed (the second occurrence near the other web-only logic).
In `@prompts/example1/spoken_style_answer_tmpl.txt`:
- Line 22: Fix the typo in the user-facing prompt template: in the prompt string
that begins "Keep your answer succinct and straight to the point as if speaking
to humain." (in spoken_style_answer_tmpl.txt / the "Keep your answer..."
prompt), change "humain" to "human" so the sentence reads "...speaking to
human."
---
Duplicate comments:
In `@utility/data_indexer2.py`:
- Around line 1-121: This file duplicates existing data_indexer.py and shouldn't
be upstream as-is; either remove it from the PR or rename it to reflect purpose
(e.g., pdf_benchmark_indexer.py) and refactor common logic (arg parsing,
__check_api, upload loop) into a shared helper module to prevent drift; if you
keep it, add a short docstring near the top declaring its benchmark intent and
justify the hard-coded limit = 100 and the page bounds check (the page_count <=
10 or page_count > 40 filter) so reviewers know they are intentional.
---
Nitpick comments:
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py`:
- Around line 55-230: Extract duplicated utilities and schemas into a shared
module (e.g., benchmarks/prompt_eval/_common.py): move _parse_env_list,
_build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt
and the Pydantic models TemporalPredicate, Query, SearchQueries (and any CLI
path-sandboxing helpers) into that module; then import and use them from
eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both
files reference the single implementation; ensure imports/exports preserve names
and behavior (including default values like max_completion_tokens and any
langdetect fallback), run lints/tests, and update any relative references to
MODELS or judge_config usage to use the shared module.
In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt`:
- Around line 31-42: The "past N" and "bare MONTH" rules are ambiguous: change
the "past N days/weeks/months/years → `[Current − N, tomorrow)`" line to
explicitly carry the time unit (e.g., "past N days/weeks/months/years →
`[Current − N {days|weeks|months|years}, tomorrow)`") and ensure "weeks" and
"months" are interpreted as whole-week and whole-month arithmetic (not unitless
subtraction). Clarify the "bare MONTH (no year) → nearest past occurrence" rule
by specifying deterministic behavior: if MONTH is before the current month use
MONTH of the current year; if MONTH is the current month or after, use MONTH of
the previous year (i.e., the most recent calendar occurrence strictly before
today). Ensure these exact phrases (the "past N days/weeks/months/years" rule
and the "bare MONTH (no year)" rule) are updated so models include the unit and
the tie-breaker rule unambiguously.
In `@benchmarks/prompt_eval/Report_query_decomposition.md`:
- Around line 50-63: The "Common hard cases (both models, both prompts)" bullet
list uses repetitive sentence openings and punctuation; please rewrite that
section (the header and the four bullets referencing id 30, id 63, id 70, id 74)
to vary sentence structure and lead-ins for readability — e.g., turn some
bullets into full sentences starting with the case description ("Compare AWS
Lambda and Google Cloud Functions") and others starting with the observed
failure ("Earlier gRPC turn..."), replace repeated em-dash openings with varied
connectors or clauses, and ensure each item still mentions the id and the
failure mode (single comparison, contaminated reformulation, under-splitting)
without changing meaning.
In `@openrag/components/pipeline.py`:
- Around line 73-77: The TemporalPredicate currently allows equality operators
but we only use ordered comparisons for TIMESTAMPTZ; update the type/definition
for the TemporalPredicate.operator (the Literal used where TemporalPredicate is
declared) to remove "==" and "!=" so only "<", ">", "<=", ">=" are permitted;
keep the to_milvus_filter() behavior (it can continue to emit ISO "..." strings)
and ensure any type hints or validations referencing TemporalPredicate.operator
(e.g., construction sites or validators) are updated to the narrower set.
In `@utility/data_indexer.py`:
- Around line 36-46: The __check_api function currently only raises on
httpx.RequestError and silently continues on non-200 responses; modify
__check_api to detect non-200 HTTP responses (e.g., call
response.raise_for_status() or explicitly check response.status_code != 200) and
raise an exception (or log and raise) so the caller fails fast; update the
try/except around httpx.get in __check_api to include HTTP status errors (using
response.raise_for_status() is preferred) so any 4xx/5xx result causes the
function to raise instead of returning silently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 395f8e6d-4bcc-412e-9833-038603131849
⛔ Files ignored due to path filters (1)
benchmarks/prompt_eval/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
benchmarks/prompt_eval/.env.examplebenchmarks/prompt_eval/Report_query_decomposition.mdbenchmarks/prompt_eval/Report_temporal_filter.mdbenchmarks/prompt_eval/datasets/query_decomposition.jsonbenchmarks/prompt_eval/datasets/temporal_filter.jsonbenchmarks/prompt_eval/eval_query_decomposition.pybenchmarks/prompt_eval/eval_temporal_filter_generation.pybenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txtbenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txtbenchmarks/prompt_eval/pyproject.tomlbenchmarks/prompt_eval/results/result_filter_generation.jsonbenchmarks/prompt_eval/results/result_query_decomposition.jsonopenrag/components/pipeline.pyprompts/example1/query_contextualizer_tmpl.txtprompts/example1/spoken_style_answer_tmpl.txtprompts/example1/sys_prompt_tmpl.txtpyproject.tomltests/api_tests/api_run/mock_vllm.pyutility/data_indexer.pyutility/data_indexer2.py
7d30ac3 to
40ed25d
Compare
40ed25d to
240f1c1
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
benchmarks/prompt_eval/eval_query_decomposition.py (1)
335-355: Bound concurrent LLM calls to avoid rate-limit-skewed results.
asyncio.as_completed(tasks)schedules every case immediately. With 80 cases, each potentially doing a generator call and a judge call, transient rate limits can dominate the report instead of prompt quality. Consider wrappingrun_case()with anasyncio.Semaphoreand exposing a--concurrencyoption.Minimal bounded-concurrency shape
+ semaphore = asyncio.Semaphore(8) + + async def run_limited_case(case: dict) -> CaseResult: + async with semaphore: + return await run_case( + case, + prompt_template, + query_generator, + model_base_url, + coverage_judge, + judge_base_url, + ) + tasks = [ - run_case( - case, - prompt_template, - query_generator, - model_base_url, - coverage_judge, - judge_base_url, - ) + run_limited_case(case) for case in dataset ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/eval_query_decomposition.py` around lines 335 - 355, The loop currently schedules every run_case concurrently via asyncio.as_completed which can flood LLM endpoints; introduce a concurrency limiter by creating an asyncio.Semaphore (configurable via a new --concurrency CLI option) and wrap each run_case invocation in an async wrapper that acquires the semaphore before calling run_case(case, ...) and releases it after, then schedule those wrapper coroutines instead of raw run_case in tasks; keep the tqdm/as_completed consumption the same but ensure tasks are created from the semaphore-wrapped coroutine to bound simultaneous generator/judge calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/prompt_eval/.env.example`:
- Around line 2-4: The env var names BASE_URLs, API_KEYs, and MODELs in
.env.example are mixed-case and trigger dotenv-linter; either normalize them to
uppercase across all usages or add a linter exception: Option A — rename the
three variables to BASE_URLS, API_KEYS, and MODELS and update every reference in
eval_query_decomposition.py (references to BASE_URLs/API_KEYs/MODELs) and
eval_temporal_filter_generation.py (same three references) so they match; Option
B — add a .dotenv-linter.json rule to ignore these specific keys (BASE_URLs,
API_KEYs, MODELs) to suppress the warning if you want to keep the current names.
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 192-204: format_prompt currently uses datetime.now(), making
outputs non-deterministic; change format_prompt to accept a reference_date (or
read a module-level DATASET_REFERENCE_DATE set to 2026-04-17) and use that date
to produce current_date (e.g., reference_date.strftime(...)) instead of
datetime.now(); update the other place that formats current_date (the similar
block at lines ~255-258) to also use the same reference_date or constant so all
relative-date prompt injections are deterministic and reproducible.
In `@openrag/config/models.py`:
- Line 422: Change the default for temporal retrieval to strict by setting the
Allow Filterless Fallback flag to False: update the attribute
allow_filterless_fallback in the ModelsConfig (symbol:
allow_filterless_fallback) from True to False and mirror the same default in
conf/config.yaml so both runtime model config and the YAML default are
consistent; ensure any documentation or initialization logic that reads
allow_filterless_fallback continues to work with the new False default.
In `@prompts/example1/query_contextualizer_tmpl.txt`:
- Around line 26-29: The example that maps content dates to a created_at filter
is incorrect: only emit temporal_filters (with field="created_at") when the user
explicitly refers to document creation/publication/written times; for user
requests about events/meetings held in a time window leave temporal_filters null
(or emit a separate event-time predicate if your schema supports it). Update the
example that rewrites “meetings held in the past month” so it does not produce
created_at predicates, and apply the same correction to the other similar
example that currently emits created_at; keep the half-open interval rule and
ISO8601 format for true created_at predicates.
---
Nitpick comments:
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 335-355: The loop currently schedules every run_case concurrently
via asyncio.as_completed which can flood LLM endpoints; introduce a concurrency
limiter by creating an asyncio.Semaphore (configurable via a new --concurrency
CLI option) and wrap each run_case invocation in an async wrapper that acquires
the semaphore before calling run_case(case, ...) and releases it after, then
schedule those wrapper coroutines instead of raw run_case in tasks; keep the
tqdm/as_completed consumption the same but ensure tasks are created from the
semaphore-wrapped coroutine to bound simultaneous generator/judge calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7464efc4-a5bd-4fad-871a-dd7b5cdb4605
⛔ Files ignored due to path filters (1)
benchmarks/prompt_eval/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (21)
benchmarks/prompt_eval/.env.examplebenchmarks/prompt_eval/Report_query_decomposition.mdbenchmarks/prompt_eval/Report_temporal_filter.mdbenchmarks/prompt_eval/datasets/query_decomposition.jsonbenchmarks/prompt_eval/datasets/temporal_filter.jsonbenchmarks/prompt_eval/eval_query_decomposition.pybenchmarks/prompt_eval/eval_temporal_filter_generation.pybenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txtbenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txtbenchmarks/prompt_eval/pyproject.tomlbenchmarks/prompt_eval/results/result_filter_generation.jsonbenchmarks/prompt_eval/results/result_query_decomposition.jsonconf/config.yamlopenrag/components/pipeline.pyopenrag/config/loader.pyopenrag/config/models.pyprompts/example1/query_contextualizer_tmpl.txtprompts/example1/spoken_style_answer_tmpl.txtprompts/example1/sys_prompt_tmpl.txtpyproject.tomltests/api_tests/api_run/mock_vllm.py
✅ Files skipped from review due to trivial changes (9)
- prompts/example1/spoken_style_answer_tmpl.txt
- prompts/example1/sys_prompt_tmpl.txt
- pyproject.toml
- benchmarks/prompt_eval/pyproject.toml
- benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
- benchmarks/prompt_eval/datasets/temporal_filter.json
- benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
- benchmarks/prompt_eval/Report_temporal_filter.md
- benchmarks/prompt_eval/eval_temporal_filter_generation.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/api_tests/api_run/mock_vllm.py
- benchmarks/prompt_eval/datasets/query_decomposition.json
- openrag/components/pipeline.py
| # Temporal filters | ||
| Emit `temporal_filters` when the user refers to WHEN a document was created / published / written. Leave null if the date only describes the topic. | ||
| Each predicate: `field="created_at"`, `operator` ∈ {{`>=`, `>`, `<=`, `<`}}, `value` = ISO 8601 UTC. | ||
| Predicates are AND-combined. Always use half-open intervals: `>= start` AND `< next_start`. |
There was a problem hiding this comment.
Fix the example that converts a content date into a created_at filter.
Line 27 says filters are only for document creation/publication/writing time, but the example rewrites the request as meetings held in the past month and still emits created_at predicates. That will bias the model to over-filter content-date queries.
Prompt-safe fix
-User: "Summary of meetings in the past month"
-{{"query_list":[{{"query":"Summary of meetings held in the past month","temporal_filters":[{{"field":"created_at","operator":">=","value":"2026-03-15T00:00:00+00:00"}},{{"field":"created_at","operator":"<","value":"2026-04-16T00:00:00+00:00"}}]}}]}}
+User: "Summary of meeting notes uploaded in the past month"
+{{"query_list":[{{"query":"Summary of meeting notes uploaded in the past month","temporal_filters":[{{"field":"created_at","operator":">=","value":"2026-03-15T00:00:00+00:00"}},{{"field":"created_at","operator":"<","value":"2026-04-16T00:00:00+00:00"}}]}}]}}Also applies to: 48-49
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@prompts/example1/query_contextualizer_tmpl.txt` around lines 26 - 29, The
example that maps content dates to a created_at filter is incorrect: only emit
temporal_filters (with field="created_at") when the user explicitly refers to
document creation/publication/written times; for user requests about
events/meetings held in a time window leave temporal_filters null (or emit a
separate event-time predicate if your schema supports it). Update the example
that rewrites “meetings held in the past month” so it does not produce
created_at predicates, and apply the same correction to the other similar
example that currently emits created_at; keep the half-open interval rule and
ISO8601 format for true created_at predicates.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openrag/components/pipeline.py (1)
303-342:⚠️ Potential issue | 🔴 CriticalPass
q.query(string) to web search instead of Query object.At lines 312 and 335,
queries.query_listcontainsQuerypydantic objects, butself.web_search_service.search()expects a string parameter. Passing the Query object will stringify it to"Query: ..., Filter: ..."via__str__(), which is not a valid search term and will break web search.🔧 Suggested fix
- web_tasks = [self.web_search_service.search(q) for q in queries.query_list] + web_tasks = [self.web_search_service.search(q.query) for q in queries.query_list]- raw_web_lists = await asyncio.gather(*[self.web_search_service.search(q) for q in queries.query_list]) + raw_web_lists = await asyncio.gather( + *[self.web_search_service.search(q.query) for q in queries.query_list] + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 303 - 342, The web search is being called with Query objects from queries.query_list causing invalid search terms; update both places where self.web_search_service.search is invoked (the web_tasks comprehension and the inline search in web-only mode) to pass the raw string field (q.query) instead of the Query object q so search receives a plain string; ensure any variable names (e.g., web_tasks, raw_web_lists) remain compatible with this change.
🧹 Nitpick comments (1)
openrag/components/pipeline.py (1)
115-128: Filterless fallback now gated + warned — addresses prior feedback.Config flag (
allow_filterless_fallback) pluslogger.warningwith query/filter/partition context correctly resolves the earlier concern about silent temporal-filter dropping. Consider also surfacing afilter_droppedsignal on the response metadata so the UI can warn the user that results no longer respect the requested date range; not a blocker.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 115 - 128, When the code falls back to filterless retrieval (the if block checking not docs and milvus_filter and self.allow_filterless_fallback), add a machine-readable flag (e.g., filter_dropped: True) into the response metadata so callers/UIs can detect that results no longer respect the original temporal filter; set this flag on whatever response container you return from this pipeline (or on each Document/DocsMetadata) immediately after docs = await self.retriever.retrieve(...), using the same context (query.query, partition, milvus_filter) so downstream code can surface a warning. Ensure the flag is only present when this fallback path executes and preserved in the pipeline’s output metadata structure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openrag/components/pipeline.py`:
- Around line 74-78: The Milvus filter builder in to_milvus_filter currently
emits expressions like ISO "..." which Milvus boolean grammar doesn't accept;
update the format in the list comprehension inside to_milvus_filter (where parts
is built from self.temporal_filters) to drop the "ISO " prefix and emit just the
quoted timestamp (e.g., f'{p.field} {p.operator} "{p.value}"'), then join parts
as before and return the resulting string.
---
Outside diff comments:
In `@openrag/components/pipeline.py`:
- Around line 303-342: The web search is being called with Query objects from
queries.query_list causing invalid search terms; update both places where
self.web_search_service.search is invoked (the web_tasks comprehension and the
inline search in web-only mode) to pass the raw string field (q.query) instead
of the Query object q so search receives a plain string; ensure any variable
names (e.g., web_tasks, raw_web_lists) remain compatible with this change.
---
Nitpick comments:
In `@openrag/components/pipeline.py`:
- Around line 115-128: When the code falls back to filterless retrieval (the if
block checking not docs and milvus_filter and self.allow_filterless_fallback),
add a machine-readable flag (e.g., filter_dropped: True) into the response
metadata so callers/UIs can detect that results no longer respect the original
temporal filter; set this flag on whatever response container you return from
this pipeline (or on each Document/DocsMetadata) immediately after docs = await
self.retriever.retrieve(...), using the same context (query.query, partition,
milvus_filter) so downstream code can surface a warning. Ensure the flag is only
present when this fallback path executes and preserved in the pipeline’s output
metadata structure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8e1f2b90-06ba-47ce-82cf-e919741b981a
📒 Files selected for processing (1)
openrag/components/pipeline.py
4f47041 to
25015db
Compare
25015db to
0485e59
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
openrag/components/pipeline.py (1)
211-273: AddValidationErrorandOutputParserExceptiontoexceptions_to_handlein the fallback chain.The current setup only triggers the
function_callingfallback onopenai.BadRequestError. However, whenjson_schemaoutput fails Pydantic validation (raisingValidationError) or parsing (raisingOutputParserException), the outer retry loop catches these exceptions but attempts the samejson_schemapath again on the second try, never exercising the fallback. Adding these exceptions toexceptions_to_handleensures the fallback actually gets a chance to succeed on schema validation failures:Change
self.query_generator = primary.with_fallbacks( [fallback], - exceptions_to_handle=(openai.BadRequestError,), + exceptions_to_handle=(openai.BadRequestError, ValidationError, OutputParserException), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 211 - 273, The fallback chain currently only handles openai.BadRequestError so schema/parse failures never trigger the function_calling fallback; update the with_fallbacks call that builds self.query_generator (primary, fallback, exceptions_to_handle) to also include ValidationError and OutputParserException alongside openai.BadRequestError so that schema validation/parsing errors will invoke the fallback path (references: primary, fallback, self.query_generator, with_fallbacks, exceptions_to_handle, ValidationError, OutputParserException, openai.BadRequestError).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py`:
- Around line 128-139: The benchmark's TemporalPredicate model allows "==" and
"!=" but production's TemporalPredicate (openrag/components/pipeline.py) only
permits ">", "<", ">=", "<="; update the operator Literal in TemporalPredicate
in this file (class TemporalPredicate, field operator) to match production by
removing "==" and "!=" so it becomes Literal[">", "<", ">=", "<="], and make the
identical change to the copy in eval_query_decomposition.py so both benchmarks
mirror production validation and scoring.
In `@tests/api_tests/api_run/mock_vllm.py`:
- Around line 238-264: Handle enum/const and prevent infinite recursion: update
_mock_value to check for "const" and "enum" (return schema["const"] if present;
for "enum" return the first enum value that matches the expected type) before
the type branches so constrained strings (e.g., TemporalPredicate.op) are mocked
correctly; in _resolve_ref replace lstrip("#/") with removeprefix("#/") to
correctly strip JSON Pointer prefixes; add a simple cycle guard by threading a
visited set (e.g., visited_ids) through _resolve_ref and _mock_value to detect
and break recursive $ref loops (return None or a safe default when a cycle is
seen). Ensure you reference the existing function names _mock_value and
_resolve_ref when making changes.
---
Nitpick comments:
In `@openrag/components/pipeline.py`:
- Around line 211-273: The fallback chain currently only handles
openai.BadRequestError so schema/parse failures never trigger the
function_calling fallback; update the with_fallbacks call that builds
self.query_generator (primary, fallback, exceptions_to_handle) to also include
ValidationError and OutputParserException alongside openai.BadRequestError so
that schema validation/parsing errors will invoke the fallback path (references:
primary, fallback, self.query_generator, with_fallbacks, exceptions_to_handle,
ValidationError, OutputParserException, openai.BadRequestError).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ee251a91-f7f7-4536-8d70-5508ff06add8
⛔ Files ignored due to path filters (1)
benchmarks/prompt_eval/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (21)
benchmarks/prompt_eval/.env.examplebenchmarks/prompt_eval/Report_query_decomposition.mdbenchmarks/prompt_eval/Report_temporal_filter.mdbenchmarks/prompt_eval/datasets/query_decomposition.jsonbenchmarks/prompt_eval/datasets/temporal_filter.jsonbenchmarks/prompt_eval/eval_query_decomposition.pybenchmarks/prompt_eval/eval_temporal_filter_generation.pybenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txtbenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txtbenchmarks/prompt_eval/pyproject.tomlbenchmarks/prompt_eval/results/result_filter_generation.jsonbenchmarks/prompt_eval/results/result_query_decomposition.jsonconf/config.yamlopenrag/components/pipeline.pyopenrag/config/loader.pyopenrag/config/models.pyprompts/example1/query_contextualizer_tmpl.txtprompts/example1/spoken_style_answer_tmpl.txtprompts/example1/sys_prompt_tmpl.txtpyproject.tomltests/api_tests/api_run/mock_vllm.py
✅ Files skipped from review due to trivial changes (7)
- prompts/example1/sys_prompt_tmpl.txt
- prompts/example1/spoken_style_answer_tmpl.txt
- pyproject.toml
- benchmarks/prompt_eval/pyproject.toml
- benchmarks/prompt_eval/datasets/temporal_filter.json
- benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
- benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- openrag/config/models.py
- conf/config.yaml
- benchmarks/prompt_eval/datasets/query_decomposition.json
- prompts/example1/query_contextualizer_tmpl.txt
| class TemporalPredicate(BaseModel): | ||
| field: Literal["created_at"] = Field(default="created_at") | ||
| operator: Literal["==", "!=", ">", "<", ">=", "<="] | ||
| value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".') | ||
|
|
||
|
|
||
| class Query(BaseModel): | ||
| query: str = Field(description="A semantically enriched, descriptive query for vector similarity search.") | ||
| temporal_filters: list[TemporalPredicate] | None = Field( | ||
| default=None, | ||
| description="Date predicates on created_at, AND-combined. Null when no creation-date restriction.", | ||
| ) |
There was a problem hiding this comment.
Operator set diverges from production TemporalPredicate — benchmark may score predicates production would reject.
Pipeline's TemporalPredicate (openrag/components/pipeline.py lines 52-54) restricts operator to Literal[">", "<", ">=", "<="]. This benchmark schema additionally permits == and !=. A candidate model that emits, e.g., operator: "==" will parse successfully here and potentially be judged "correct", but the same output would raise ValidationError in production and be dropped by the retry/fallback path. The reported filter_detection_recall and filter_correctness are therefore optimistic vs. real-world behavior.
Recommend aligning the benchmark schema with production so the evaluation reflects what the pipeline actually accepts:
♻️ Proposed alignment
class TemporalPredicate(BaseModel):
field: Literal["created_at"] = Field(default="created_at")
- operator: Literal["==", "!=", ">", "<", ">=", "<="]
+ operator: Literal[">", "<", ">=", "<="]
value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".')Same change applies to the benchmark copy in eval_query_decomposition.py (lines 108-111). If ==/!= support is ever added to production (e.g., for "on YEAR-MM-DD"), update both production and benchmark together — and note that the v1 prompt's judge rules already mention ==/!= in the JSON grammar hint, so the prompt template should be kept in sync as well.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class TemporalPredicate(BaseModel): | |
| field: Literal["created_at"] = Field(default="created_at") | |
| operator: Literal["==", "!=", ">", "<", ">=", "<="] | |
| value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".') | |
| class Query(BaseModel): | |
| query: str = Field(description="A semantically enriched, descriptive query for vector similarity search.") | |
| temporal_filters: list[TemporalPredicate] | None = Field( | |
| default=None, | |
| description="Date predicates on created_at, AND-combined. Null when no creation-date restriction.", | |
| ) | |
| class TemporalPredicate(BaseModel): | |
| field: Literal["created_at"] = Field(default="created_at") | |
| operator: Literal[">", "<", ">=", "<="] | |
| value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".') |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py` around lines 128 -
139, The benchmark's TemporalPredicate model allows "==" and "!=" but
production's TemporalPredicate (openrag/components/pipeline.py) only permits
">", "<", ">=", "<="; update the operator Literal in TemporalPredicate in this
file (class TemporalPredicate, field operator) to match production by removing
"==" and "!=" so it becomes Literal[">", "<", ">=", "<="], and make the
identical change to the copy in eval_query_decomposition.py so both benchmarks
mirror production validation and scoring.
| def _mock_value(schema: dict, root: dict, user_text: str): | ||
| """Recursively build a mock value that satisfies the given JSON Schema node.""" | ||
| schema = _resolve_ref(schema, root) | ||
| # anyOf / oneOf — nullable fields (union with null) return None to avoid invalid mock values | ||
| for combiner in ("anyOf", "oneOf"): | ||
| if combiner in schema: | ||
| branches = schema[combiner] | ||
| has_null = any(s.get("type") == "null" for s in branches) | ||
| if has_null: | ||
| return None | ||
| non_null = [s for s in branches if s.get("type") != "null"] | ||
| if non_null: | ||
| return _mock_value(non_null[0], root, user_text) | ||
| return None | ||
| t = schema.get("type") | ||
| if t == "object" or "properties" in schema: | ||
| return {k: _mock_value(v, root, user_text) for k, v in schema.get("properties", {}).items()} | ||
| if t == "array": | ||
| items = _resolve_ref(schema.get("items", {}), root) | ||
| return [_mock_value(items, root, user_text)] | ||
| if t == "string": | ||
| return user_text | ||
| if t in ("integer", "number"): | ||
| return 0 | ||
| if t == "boolean": | ||
| return False | ||
| return None |
There was a problem hiding this comment.
Mock does not honor enum — will break structured-output tests for TemporalPredicate.op.
_mock_value has no branch for enum, so a constrained string like TemporalPredicate.op (e.g., >=, <=, ==) will be mocked as the user query text and fail Pydantic validation on the caller side (which uses with_structured_output(..., strict=True)). This would likely invalidate the new temporal-filter path in tests.
Also minor on the same helpers:
- Line 231:
lstrip("#/")strips any leading#//as a char set; useremoveprefix("#/")for correct prefix semantics on JSON Pointer fragments. _resolve_ref/_mock_valuehave no cycle guard; a recursive$refschema would recurse infinitely.
🛠️ Proposed fix: honor enum and const
def _mock_value(schema: dict, root: dict, user_text: str):
"""Recursively build a mock value that satisfies the given JSON Schema node."""
schema = _resolve_ref(schema, root)
+ # const / enum — pick a concrete allowed value so strict validation passes
+ if "const" in schema:
+ return schema["const"]
+ if "enum" in schema and schema["enum"]:
+ non_null = [v for v in schema["enum"] if v is not None]
+ return non_null[0] if non_null else None
# anyOf / oneOf — nullable fields (union with null) return None to avoid invalid mock values
for combiner in ("anyOf", "oneOf"):- parts = schema["$ref"].lstrip("#/").split("/")
+ parts = schema["$ref"].removeprefix("#/").split("/")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/api_tests/api_run/mock_vllm.py` around lines 238 - 264, Handle
enum/const and prevent infinite recursion: update _mock_value to check for
"const" and "enum" (return schema["const"] if present; for "enum" return the
first enum value that matches the expected type) before the type branches so
constrained strings (e.g., TemporalPredicate.op) are mocked correctly; in
_resolve_ref replace lstrip("#/") with removeprefix("#/") to correctly strip
JSON Pointer prefixes; add a simple cycle guard by threading a visited set
(e.g., visited_ids) through _resolve_ref and _mock_value to detect and break
recursive $ref loops (return None or a safe default when a cycle is seen).
Ensure you reference the existing function names _mock_value and _resolve_ref
when making changes.
feat: temporal
created_atfilters in the RAG query pipelineSummary
Adds LLM-driven temporal filters to retrieval so users can naturally ask time-bounded questions ("reports from last month", "meeting notes this week") and have them translated into Milvus
created_atfilters before vector search.Changes
openrag/components/pipeline.py:TemporalPredicate(field/op/value oncreated_at) andQuery(text + AND-combined predicates).SearchQueriesnow returns a list ofQueryobjects instead of raw strings.RetrieverPipeline.retrieve_docstakes aQueryand derives the Milvus filter viaQuery.to_milvus_filter()— no schema change on the collection.retriever.allow_filterless_fallback(envRETRIEVER_ALLOW_FILTERLESS_FALLBACK, defaulttrue) so strict deployments can opt out.ValidationError/OutputParserException, then falls back to the raw user query.with_structured_output(..., strict=True).prompts/example1/query_contextualizer_tmpl.txt, v1) adds query-decomposition rules, temporal-filter rules with explicit lower/upper-bound semantics, and richer{current_date}(weekday + time) to disambiguate relative expressions. Key distinction: ingestion date → filter; document topic date → no filter.Benchmarks
Offline harness under
benchmarks/prompt_eval/comparing prompt v0 (pre-change) vs v1 (this PR) on the two production-candidate models. Judge: Qwen3-VL-8B-Instruct-FP8. Full results inbenchmarks/prompt_eval/Report_temporal_filter.mdandbenchmarks/prompt_eval/Report_query_decomposition.md.Temporal filter detection (40 cases, 20 pos / 20 neg):
Query decomposition (80 cases across D1/D2/D3):
Recommended pairing: v1 + Mistral-Small-3.1-24B — 100% filter detection, 90% filter-body correctness, and best decomposition numbers. Residual failures are narrow (e.g. spurious upper bound on "since last Monday", over-wide window on "latest") and documented in the reports.
Backwards compatibility
No collection/schema change. Queries with no temporal reference produce
temporal_filters=Noneand hit the same retrieval path as before.Summary by CodeRabbit
New Features
Documentation
Chores